Goal: Demonstrate proficiency in text mining (sentiment analysis) using Python.

Synopsis:

The data package contains two files each corresponding to 212404 unique books that were available on Amazon between 1996 and 2014. 142.8 million reviews spanning May 1996 - July 2014 have also been collected. The two data files maintain the following structure:

Execution of required modules.

Task 2: Data Cleansing, Classification, and Word Clouds .

1. [5 Marks] Load the data into Python and perform the initial sanity checks.

1-1. Variable Analysis

Check the structure of those data and the number of null values.

1-2. Data Merge

1-3. Treating missing values

1-4. Treating outlier

2. [10 Marks] Generate initial analytical visualizations to understand the reviews (i.e., histogram) including a word cloud (using the wordcloud package). Discuss your findings.

2-1. Visualize what words are often included in 'review/text' using a word cloud.

2-2. Display of number of data per review score

2-3. A number of characters in each 'review/text'.

2-4. Visualize the number of data per publication year.

3. [5 Marks] Classify reviews into two ‘sentiment’ categories called positive and negative.

4. [10 Marks] Generate positive and negative word clouds. Discuss your findings while comparing the positive and negative summaries (you may include other graphs if needed).

Using Word Cloud, we visualized the high-frequency WORDS for each high and low review.

Task 3: Prediction

1. [15 Marks] Build a simple logistic regression model to predict the sentiment category based on a text- based review. Discuss your findings.

We can use SciKit Learn's built-in classification report, which returns precision, recall, f1-score, and a column for support (meaning how many cases supported that classification). Check out the links for more detailed info on each of these metrics and the figure below:

2. [15 Marks] Build a multinomial logistic regression model to predict the rating of a book based on its text-based review. Discuss your findings.

Fin.